Acoustic Scene Recognition with Deep Learning

نویسنده

Wei Dai

چکیده

Background. Sound complements visual inputs, and is an important modality for perceiving the environment. Increasingly, machines in various environments have the ability to hear, such as smartphones, autonomous robots, or security systems. This work applies state-of-the-art Deep Learning models that have revolutionized speech recognition to understanding general environmental sounds. Aim. This work aims to classify 15 common indoor and outdoor locations using environmental sounds. We compare both conventional and Deep Learning models for this task. Data. We use a dataset from the ongoing IEEE challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). The dataset contains 15 diverse indoor and outdoor locations, such as buses, cafes, cars, city centers, forest paths, libraries, and trains, totaling 9.75 hours of audio recording. Methods. We extract features using signal processing techniques, such as mel-frequency cepstral coefficients (MFCC), various statistical functionals, and spectrograms. We extract 4 feature sets: MFCCs (60-dimensional), Smile983 (983-dimensional), Smile6k (6573-dimensional), and spectrograms (only for CNN-based models). On these features we apply 5 models: Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Recurrent Deep Neural Networks (RDNNs), Convolutional Neural Networks (CNNs), and Recurrent Convolutional Neural Networks (RCNNs). Among them GMMs and SVMs are popular conventional models for this task, while RDNNs, CNNs, and RCNNs are, to our knowledge, the first application of these models in the context of environmental sound. Results. Our experiments show that model performance varies with features. With a small set of features (MFCCs and Smile983) temporal models (RNNs, RDNNs) outperform non-temporal models (GMMs, SVMs, DNNs). However, with large feature sets (Smile6k) DNNs outperform temporal models (RNNs and RDNNs) and achieve the best performance among all studied methods. The GMM with MFCC features, the baseline model provided by the DCASE contest, achieves 67.6% test accuracy, while the best performing model (a DNN with the Smile6k feature) reaches 80% test accuracy. RNNs and RDNNs generally have performance in the range of 68∼77%, while SVMs vary between 56∼73%. CNNs and RCNNs with spectrogram features lag in performance compared with other Deep Learning models, reaching 63∼64% accuracy. Conclusions. We find that Deep Learning models compare favorably to conventional models (GMMs and SVMs). No single model outperforms all other models across all feature sets, showing that model performance varies significantly with the feature representation. The fact that the best performing model is a non-temporal DNN is evidence that environmental sounds do not exhibit strong temporal dynamics. This is consistent with our day-to-day experience that environmental sounds tend to be random and unpredictable.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining pattern recognition and deep-learning-based algorithms to automatically detect commercial quadcopters using audio signals (Research Article)

Commercial quadcopters with many private, commercial, and public sector applications are a rapidly advancing technology. Currently, there is no guarantee to facilitate the safe operation of these devices in the community. Three different automatic commercial quadcopters identification methods are presented in this paper. Among these three techniques, two are based on deep neural networks in whi...

متن کامل

Deep Neural Network Bottleneck Feature for Acoustic Scene Classification

Bottleneck features have been shown to be effective in improving the accuracy of speaker recognition, language identification and automatic speech recognition. However, few works have focused on bottleneck features for acoustic scene classification. This report proposes a novel acoustic scene feature extraction using bottleneck features derived from a Deep Neural Network (DNN). On the official ...

متن کامل

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

Automatic speech recognition (ASR) on video data naturally has access to two modalities: audio and video. In previous work, audio-visual ASR, which leverages visual features to help ASR, has been explored on restricted domains of videos. This paper aims to extend this idea to open-domain videos, for example videos uploaded to YouTube. We achieve this by adopting a unified deep learning approach...

متن کامل

Deep Within-Class Covariance Analysis for Acoustic Scene Classification

Within-Class Covariance Normalization (WCCN) is a powerful post-processing method for normalizing the within-class covariance of a set of data points. WCCN projects the observations into a linear sub-space where the within-class variability is reduced. This property has proven to be beneficial in subsequent recognition tasks. The central idea of this paper is to reformulate the classic WCCN as ...

متن کامل

Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation

In recent years, neural network approaches have shown superior performance to conventional hand-made features in numerous application areas. In particular, convolutional neural networks (ConvNets) exploit spatially local correlations across input data to improve the performance of audio processing tasks, such as speech recognition, musical chord recognition, and onset detection. Here we apply C...

متن کامل

Pairwise Decomposition with Deep Neural Networks and Multiscale Kernel Subspace Learning for Acoustic Scene Classification

We propose a system for acoustic scene classification using pairwise decomposition with deep neural networks and dimensionality reduction by multiscale kernel subspace learning. It is our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016). The system classifies 15 different acoustic scenes. ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Acoustic Scene Recognition with Deep Learning

نویسنده

چکیده

منابع مشابه

Combining pattern recognition and deep-learning-based algorithms to automatically detect commercial quadcopters using audio signals (Research Article)

Deep Neural Network Bottleneck Feature for Acoustic Scene Classification

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

Deep Within-Class Covariance Analysis for Acoustic Scene Classification

Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation

Pairwise Decomposition with Deep Neural Networks and Multiscale Kernel Subspace Learning for Acoustic Scene Classification

عنوان ژورنال:

اشتراک گذاری